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Abstract 



The Equi-Energy Sampler (EES) introduced by iKou et ah |2006j is based on a popula 
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,2 ■ tion of chains which are updated by local moves and global moves, also called equi-energy 

^ , ' jumps. The state space is partitioned into energy rings, and the current state of a chain can 

jump to a past state of an adjacent chain that has an energy level close to its level. This 
algorithm has been developed to facilitate global moves between different chains, resulting 
in a good exploration of the state space by the target chain. This method seems to be more 
efficient than the classical Parallel Tempering (PT) algorithm. However it is difficult to use 
. in combination with a Gibbs sampler and it necessitates increased storage. In this paper we 

1 propose an adaptation of this EES that combines PT with the principle of swapping between 

chains with same levels of energy. This adaptation, that we shall call Parallel Tempering 
with Equi-Energy Moves (PTEEM), keeps the original idea of the EES method while ensur- 
ing good theoretical properties, and practical implementation. Performances of the PTEEM 
algorithm are compared with those of the EES and of the standard PT algorithms in the 
context of mixture models, and in a problem of identification of transcription factor binding 
motifs. 

Keywords: equi-energy sampler, parallel tempering, population-based Monte Carlo Markov Chains, 
algorithm convergence, mixture models, binding sites for transcription factors. 



1 Introduction 



A common problem in Bayesian statistics is that of generating random variables from a target 
density tt. Many solutions have been proposed in the last two deca des, deriving essentially from 
the Monte Car lo Markov Chains (MCMC) approach introduced by Metropolis et al.1 |l953l | and 
Hastings 1970| . In classical MCMC methods, a Markov process is built to sample the target 
probability distribution. But in practice, the Markov process can be e asily trapped into a loc al 
mode from where it cannot escape in reasonable time (see for instance Ibian. and Wo^ » 
Many techniques have been propose d to address this waiting time problem , inclu ding among 
others Parallel Tempering (PT) (see iGey er 199 l|| or iGeyer and Thompson! [1995]), and more 
recently the Equi-Energy Sampler (EES) (jKou et al.l 2006j |). 
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In the PT algorithm, N temperatures are introduced, and N chains are run in parallel, 
with target distributions being tempered distributions of the target tt. Note that the first chain 
targets tt. Since the tempered distributions becomes flatter as the temperature increases, the 
chains at high temperatures can move easily between modes. Each iteration of the PT algorithm 
is decomposed into two types of moves: local moves via classical MCMC algorithms to update 
the different chains, and global moves allowing swaps between two chains. The use of these swaps 
enables new modes to be propagated through the different chains, thereby improving mixing. 
The first chain associated with the target distribution will then be able to escape from local 
modes. Some improve ments of PT have been proposed, like swaps with delayed rejection (see 



Green and Miral [20011 ] ) which per mit to propose a new swa p when the first one is not accepted, 



or like Evolutionary Monte Carlo ( Liang and Wong 200l| ). However this PT algorithm does 
not retain information of where chains have been and it does not choose one of the best swaps. 
This is what is done by the EES proposed by Kou et al. 2006| , by using a partition of the state 



sp ace along the energy fu nctio n. Note that such a pa rtitioning has already been recommended 
by iMitsutake et al.l |2003| and lAtchade and Liul |2006l |. in an importance sampling framework. 

In the EES algorithm, the target density is rewritten in terms of energy function, and K 
temperatures and energy levels are introduced. Then a population of K distributions is consid- 
ered, each one being a tempered distribution of tt truncated by an energy level. This algorithm is 
mainly based on a new type of move called the equi-energy jump, that aims to explore the state 
space by moving directly between states with similar energy. The goal is still to improve mixing 
of the chains. However, to perform these moves, the sampler uses past states of the different 
chains. All these past states should then be kept in memory. A substantial advantage of this 
algo rithm is that it s eems to be very efficient compared to classical MCMC methods and to PT 
(see Kou et al. 2006| ). But an associated drawback is the cost of increased storage, all the past 



being taken into account in equi-energy jumps. In addition some difficulties are encountered to 
combine EES with a Gibbs sampler. The problem is to sample from the tempered distributions 
truncated by energy levels. Some algorithms could be used to sample from it, like accept-reject or 
Approximate Bayesian Computation algorithms, but the computational cost would then be too 
high in practice. From a theoretical point of view, the EES is not based on a Markov chain, and 
its theoretical analysis is relatively difficult. Several authors studied it s convergence under var - 
ious assumptions. Th e proof of the convergen ce h as been discussed in lAtchade and Liul [2006] , 



Andrieu et al.l [2007aj], lAndrieu et al.l |2007bl ] and lAndrieu et all [2008fl. iHua and Koul (2011 1 



completed the proof of the convergence of the EES in the case of a countable st a te spa ce, and 
rece ntly more gener al convergence results has been established by Atchade et al. 2011a| . Note 
that Atchade |2O10l | showed that the asymptotic variances of adaptive MCMC algorithms (and 
hence the EES) are always at least as large as the asymptotic variances of MCMC algorithms 
with the same target distributions, and that the differences can be substantial. 

In this paper we develop an adaptation of the PT and EES algorithms, called the Parallel 
Tempering with Equi-Energy Moves (PTEEM) algorithm. An equi-energy exchange move is 
proposed, based on the energies of current states of the chains, and not on past states. Compared 
to PT algorithm, only moves between chains whose states are close in energy are proposed. This 
focuses computational effort on moves which are likely to be accepted, and hence which allow 
jumps between modes. This PTEEM algorithm can be easily combined with a Gibbs sampler, 
and its convergence is ensured. Furthermore, it does not need a large storage. The possible loss 
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or gain of this algorithm compared to EES and PT are evaluated through simulations and real 
data. 

The paper is organized as follows: In Section 2, PT and EES algorithms are briefly recalled. 
In Section 3 the PTEEM algorithm is presented. In Sections 4 and 5, performances of the 
PTEEM algorithm are compared with those of the EES and of the standard PT algorithms in 
the context of mixture models, through simulations and real data. In Section 6, PTEEM and 
EES algorithms are compared in a challenging problem of identification of transcription factor 
binding motifs. Section 7 presents concluding remarks. 



2 Background on PT and EES algorithms 
2.1 PT algorithm 

In case of complex or high dimensional problems whose densities of interest contain several modes, 
classical MCMC methods (like Metropolis-Hastings algorithm or Gibbs sampler for instance) are 
often trapped into local modes from where they cannot escape in reasonable time. To avoid this 
problem, the principle of PT is to choose N temperatures T\ = 1 < T% < • ■ ■ < Tjv, and to run in 
parallel N associated MCMC chains having different stationary distributions, 7Ti, • • • ,ttn, where 

7Tj OC T\ l l Tl . 



The higher the temperature is, the easier the exploration of the state space is for the associated 
chain. Each iteration of the PT algorithm is decomposed into local and global moves. During 
local moves, each chain is updated independently of others. In particular, the ith chain is updated 
using a classical MCMC algorithm with stationary distribution 7Tj. For a global move, two chains 
i and j are randomly chosen and a swap of their current states is proposed, and accepted with 
the following Metropolis-Hastngs ratio: 

mm< I, J J > , 

where Xi stands for the current state of the ith chain. 



2.2 EES algorithm 



To use the EES algorithm introduced bv lKou et al.l |2006j], two sequences of K temperatures and 



K+l energy levels should be chosen: H\ < H2 < • • • < Hk+i = 00 and T\ = 1 < T2 < ■ ■ • < Tk, 
where H\ < mm(h(x)) and ir(x) oc exp(— h(x)). A population of K distributions with the 
following densities is considered: 

max{h(x) , Hi} 

TTi(x) cx exp{— hi{x)\, where hi{x) = — . 

The main difference with the PT algorithm being the energy truncation. This energy truncation 
is used to flatten the distributions for easier exploration. This method uses energy rings for each 
chain, an energy ring containing past states of the chain of similar energy levels. The algorithm 
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begins by sampling the i'Tth chain from a Metropolis-Hastings kernel with stationary distribution 
7Tr-. Once convergence is reached, generated samples are stored in the energy rings of this Kth 
chain, and the next chain targeting ttk—1 starts. This (K — l)th chain will be updated by either 
(with probability p ee ) using a Metropolis-Hastings kernel with stationary distribution ttk-1, or 
by proposing to replace the current state of the chain by a past state of the previous chain 
of similar energy level. This move corresponds to the equi-energy jump, and is based on the 
energy rings of the previous chain. Once convergence is reached, generated samples are stored 
in the energy rings of this (K — l)th chain, and the next chain targeting ttk-2 starts. The EES 
algorithm successively steps down the energy and temperature ladder until the target distribution 
7Ti = 7r is reached. Each chain i, with i < K, is updated by either a Metropolis-Hastings kernel 
with stationary distribution 7Tj or by an equi-energy jump. More precisely, an equi-energy jump 
between two successive chains i and i — 1 is the following: a state y is chosen from the chain i 
such that h(y) and h(x^i) belong to energy rings of similar energy. Then y is accepted to be 
the next state of the (i — l)th chain with probability 

7Ti_i(j/)7riO;i_i 



min < 1 



L 5 ~ 



7rj_iOi_i)7ri(y) 

3 PTEEM algorithm 

3.1 Description of the algorithm 

We introduce a sequence of d+1 energy levels H\ < H2 < ■ ■ ■ < i^rf+i = 00 with H\ < min(/i(x)), 
and a sequence of N temperatures T± = 1 < T2 < ■ ■ ■ < T/y. The algorithm considers a 
population of N chains associated with probability measures iTi(x) oc ^(x) 1 /^, each 7Tj being a 
density with respect to a probability measure A on (X , B(X)), where X is a countably generated 
state space which coincides with the support of the 7Tj, and B{X) stands for the associated Borel 
cr-algebra. 

Energy rings Dj,j = l,...,d are constructed as follows: the state space X is partitioned 
according to the energy levels: X = Uj=i ^31 where 

Dj = {x G X; h(x) E [H jt H j+1 )}, j = 2, ■ ■ ■ , d 
Di = {x G X;h(x) £ (-00, H 2 )}. 

Compared to the energy rings of the EES method, these rings contain only current states, and 
there is only on e sequence of energy rings for all the chains. By contrast the rings defined by 



Kou et al.l [2006l | contain past states, and a sequence of energy rings is constructed for each chain. 

Each step of the PTEEM algorithm is decomposed into two types of moves: local moves via 
classical MCMC algorithms and global moves allowing an exchange between two chains with 
similar energies. 

Local moves Each chain is locally updated, independently of others. In particular, the ith 
chain is updated using one iteration of a classical MCMC algorithm with stationary distribution 
TTj. This algorithm could b e a Metropolis-Hastings algorith m, a Gibbs sampler, an hybrid MCMC 



|l997l|b 



Robert and Casellal [2004|), or a Reversible Jump MCMC (jGreenl [1995l |. lRichardson and Green 
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Global moves At each step, an energy ring Dj containing at least two chains is chosen ran- 
domly. Two chains are then chosen uniformly in Dj, say the ith and the kth ones (with i < k), 
and an exchange move is proposed between the current two states of these chains. The move is 
from s = (xi, ■■■ ,Xi, ■ ■ ■ ,x k ,--- ,x N ) to s' = (x 1 , ■■■ ,x k ,--- ,x N ). 

The product c-algebra is written B{X)^ ', and the product measure is denoted by Ajy. The 
probability measure ir* is defined as follows: 

N 

ir*(dxi, dx2, ■ ■ ■ , cLxn) = W_^i{xi)\(dxi) on (X N , B{X) N ) 

i=i 

The probability acceptance for the global move is then given by: 



P (s;s) = minjl,— j 



mm 



1 ■K i {x k )-K k (Xj) \ , 
' ■K i (x i )'K k {x k ) J ' 



Note that if the denominator is null, then the numerator is also null and by convention p(s; s') 
is null. The chains are not Markov by themselves, it is the whole stochastic process made of the 
iV chains together that forms a Markov chain on (X N ,B(X) N ). 

Remark 3.1 It is of interest to compare the total number of local and global moves required in 
PTEEM and EES algorithms. Let us denote by B the size of the burn-in period, by R the number 
of iterations necessary to initialize energy rings within EES, and by M the sample size required 
for the chain of interest (after the burn-in period). We have: 

• For EES, the total number of local moves is equal to 

K(B + R) + (M - R) + (1 - p e e) ( ( ^~ 1)A (B + R) + (K- 1)(M - R) 
and the total number of proposed global moves is equal to 

Pee ( ^ K ~ 2 1)K (B + R) + (K- 1)(M - R) 

where K denotes the number of chains in EES. 

• For PTEEM, the total number of local moves is NM + NB, 
and the total number of global moves is M + B, 

where N stands for the number of chains in PTEEM. 

In terms of computational cost, we should take into account that in some problems the local 
algorithms used by EES and PTEEM can be different (see Section^, and thus can have different 
computational costs. In terms of storage, to obtain the {i + \)th iteration of the target chain, EES 

uses KR + i + (1 — p ee ) ^- ^ — — + {K — \ values in memory to choose an element in an 

energy ring, whereas PTEEM necessitates only N values. Notice that CPU time to compute one 
iteration increases within EES as the simulations go along, while it is constant within PTEEM 
algorithm. 



■5 



3.2 Some theoretical results 



In this section standard sufficient conditions ensuring convergence of the PTEEM algorithm are 
given. Denote by S the Markov chain on {X , B(X) ) obtained by the PTEEM algorithm, a 
state of S is written s. The transition kernel associated with an iteration of PTEEM is written P, 
and P k is the fc-step transition kernel. They are defined on (X N xB(X) N ) 2 . The transition kernel 
associated with the local move of the ith chain is written PLi, and is defined on (X x B(X )) 2 . The 
transition kernel associated with the whole N local moves of an iteration of PTEEM is written 
PL, and is defined on (X N x B(X) N ) 2 . The transition kernel associated with the equi-energy 
move is written PE, and is defined on (X N x B(X) N ) 2 . Writing 



S — (xi , . . . , Xj , . . . , Xfc i ■ ■ ■ i 3?tv) 
(Xi, . . . , Xj, . . . , Xfc, • • • , Xjy\ 



N 



we have 

PL(s,s') = ~[PL i (x i ,x' i ), 

i=l 

P(s,s') = (PE*PL)(s,s')= [ PE(s,s')PL(s,s)ds. 

Jx N 

Write q(s, s') the auxiliary distribution to propose s' from s in an equi-energy move, and qt(xi, x'^) 
the auxiliary distribution to propose x\ from Xj in a local move of the ith chain. The total 
variation norm for a measure fi on (X N , B(X^) is defined by: 

\\h\\tv = sup \fi(A)\. 

A£B(X^) 

Proposition 3.1 // the transition kernels associated with the local moves are reversible with 
stationary distributions m, i = 1, ■ ■ ■ , N, aperiodic and strongly X-irreducible, then the chain S 
is reversible, strongly \jy -irreducible and we have for tt* -almost all s G X 

lim ||P n (s,.) -vr*|| TV = 0. 

n— >oo 

Therefore ir* is the stationary distribution of S and the chain associated with T\ = 1 provides 
samples corresponding to tti = tt, which is the target distribution. 

Proof. See Appendix lB.il 

Remark 3.2 In Proposition \3.1\ the transition kernels of the local moves are assumed to be 
aperiodic. We can relax this hypothesis. In fact, it is sufficient that only one of the N transition 
kernel is aperiodic to have P aperiodic. 

The reversibility hypothesis can also be relaxed. If the reversibility of the local transition kernels is 
not assumed, the convergence results remained, but the reversibility of the cha in S is no more en 



sured. This reversibility can be interesting in order to use limit theorems (see \Robert and Casella 
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This proposition has minimal assumptions, which are usually not difficult to verify, especially 
for classical MCMC algorithms like Metropolis-Hastings algorithms or Gibbs samplers. However, 
it is possible to have a null set of states from which convergence does not occur. The following 
lemma and proposition have stronger assumptions that ensure convergence from all starting 
points. 

Lemma 3.1 Assume that the transition kernels associated with the local moves are reversible 
with stationary distributions m, i = 1, • • • , N, aperiodic and strongly X-irreducible, and assume 
the strict positivity of the density it* on X (Vs G X n ,tt*(s) > 0). Then the chain S is reversible, 
strongly An -irreducible, positive and Harris-recurrent. 

Proof. See Appendix IB. 21 

The following proposition is a consequence of Lemma 13.11 

Proposition 3.2 Assume that the transition kernels associated with the local moves are re- 
versible with stationary distributions 7Tj, i = 1, ••• ,N, aperiodic and strongly X-irreducible, and 
assume the strict positivity of the density tt* on X N (Vs G X , vr*(s) > 0). Then we have for all 

seX N 

lim ||P n (s,.) -7T*|| T y = 0. 

n— >oo 



Proof: Using Lemma 13.11 and Proposition 13.11 S is a Markov chain 7r*-irreducible, aperiodic, 
with stationna ry distribution it* and Harris-recurrent. The result follows from Theorem 1 of 
Tiernevl |l994l |. □ 



3.3 Calibration 

Ideally, the more chains and energy rings are used, the better the result of the algorithm will 
be. However, it is not always possible in practice for computational reasons. That is why we 
suggest from our experience simple ways to choose the number of chains and energy rings, and 
to calibrate the energy ladder and the temperatures. 

Number of rings The number of energy rings d should be chosen in relation with the com- 
plexity of the target density. For instance, if the amplitude between the larger and the lower 
energy levels is large, or if the different modes are associated with different energy levels, then it 
is necessary to increase the number of rings. 

Energy ladder The levels Hi,H2, ■ ■ ■ ,Hd determine the energy rings. The first energy ring 
includes states having an energy level lower than H2 , and ideally only few states having an energy 
level lower than H\. The last energy ring includes states having an energy value higher than 
Hd- To choose H\ and we use one or few runs of a classical MCMC algorithm with target 
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density ir. We take for the energy associated with a state with high enough finite energy 

compared to other states. Concerning Hi, we take the energy corresponding to an observed 

mode. In practice, we can take for H& the energy associated with a state after few iterations of 

the algorithm, and for Hi the energy associated with a state after a burn-in period. 

Once the values Hi and H& are chosen, the ln(flj) or the ln(i7j + i — Hi) can be set to be evenly 

spaced. 

Remark 3.3 Concerning Hi, if the modes of the distribution of interest are known, we just have 
to take Hi slightly lower than the energy of the highest mode. 



Number of chains If the number of chains N is chosen too small, the chance of having samples 
from different chains in the same ring will be too small. From our experience, N should be at 
least equal to 3d, and choosing it between 3d and 5d is usually satisfactory. However, one can 
always use more chains if the computation time is not a problem. 



Temperatures The distribution associated with the highest temperature should be sufficiently 
flattened so that the associated chain can move freely from one mode to another. After choosing 
a Tjy value we just have to check that the associated chain moves easily. T\ is obviously equal 
to 1, and is associated with the chain of interest. Once T\ and Tjv are fixed, the other tem- 
peratures can be chosen by evenly spacing them on a logarithmic scale, by ev enly spacing thei r 
inverses, or by evenly spacin g t heir inverse s geometrical ly (see for instance iKou et al.l [20061 ] . 



Nagata and Watanabd [20081 1 or iNeall [1996]). Following lAtchade et al.l |2011bl ] . we can try to 



adjust the temperatures so that the proportion of accepted equi-energy moves is approximately 
0.234. 

We tried simple ways to choose the energy levels in combination with the temperatures, but 
none of them gave conclusive results. However, if the expression of the target density is known, 
it could be possible to choose theoretically the temperatures and the energy levels so that each 
ring contains in mean the same number of chains. 



Checking that the choices of temperatures and energy ladder are relevant It is 

necessary to check on a run of PTEEM that the choices of temperatures and energy ladder are 
relevant. The chain 1 should have almost all its states in the first energy ring, and the last chain 
should have almost all its states in the last energy ring. For the other chains, the distribution in 
the rings can be considered as correct if there is no "energy gap" between adjacent chains, and 
if each chain performed equi-energy moves with chains having higher and lower temperatures. If 
this is not the case, poor mixing is observed between chains, and it is then necessary to adjust 
the temperatures or the energy levels, adding new temperatures for instance or proposing a new 
calibration. This problem of calibration is illustrated in Table [TJ 



4 Example of simulations using local Metropolis-Hastings moves 

To compare the three algorithms (PT, EES and PTEEM) when the local move is a Metropolis- 
Hastings algorithm, we consider sampling from a two-dimensional normal mixture model taken 
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Bad repartition 


Good repartition 


Energy ring 


1 


2 


3 


4 


5 


1 


2 


3 


4 


5 


chain i — 2 


990 


10 











990 


10 











chain % — 1 


950 


50 











701 


202 


97 








chain i 


900 


100 











387 


408 


205 








chain i + 1 





2 


237 


511 


250 


45 


312 


355 


288 





chain i + 2 








105 


610 


285 





64 


517 


353 


66 



Table 1: Illustration for bad and good repartitions of the states in the energy rings. There is an 
energy gap between chains i and i + 1 in the bad repartition case. 



from Liang and Wong 2001 1 and used as an illustration by Kou et al. 2006| . Let 

1 



20 



m = £ 



W; 



exp 



2 (x - Hi)'(x - Hi)), 



where (J\ = 
(/*!,-■ 



(720 = 0.1, Wi 



W 2 



2a, 



0.05, and the 20 mean vectors 



, M20J 



6.87 
5.40 



2.18 8.67 4.24 8.41 3.93 3.25 1.70 4.59 6.91 
5.76 9.59 8.48 1.68 8.82 3.47 0.50 5.60 5.81 

5.41 2.70 4.98 1.14 8.33 4.93 1.83 2.26 5.54 1.69 
2.65 7.88 3.70 2.39 9.50 1.50 0.09 0.31 6.86 8.11 



The different local modes are quite far from each other (most of them are more than 15 stan- 
dard deviations from the nearest ones), hence this mixture distribution is quite challenging for 
sampling. In addition, the initial states of the different chains were drawn from a uniform dis- 
tribution on [0, l] 2 , a region far from the local modes. 

Each algorithm was run 100 times. For each run, the PT and PTEEM algorithms were run for 
2500 iterations after a burn-in period of 2500 iterations. Similarly, for each chain of the EES the 
burn-in period was of 2500 iterations, and for the first chain (the target chain) 2500 iterations 
were simulated a fter this bu r n-in p eriod and the period to construct the rings, which was of 500 
iterations. As in iKou et al.l [2006], the Metropolis-Hastings proposal was a bivariate Gaussian 



Af 2 (X$,T?I 2 ), with n = 0.25^/Ti. Unlike them, the step size n was not tuned later in 



the algorithms such that the acceptance ratio is in the range (0.22,0.32). Indeed, we would like 
to compare algorithms as simple as possible. For the EES, we took the same number of chains, 
the same energy levels, the same temperatures and the same equi-energy jump probability than 



Kou et al.1 [2006l | (K = 5, H = (0.2,2,6.3,20,63.2), T = (1,2.8,7.7,21.6,60), p ee = 0.1). For 
the PT and PTEEM algorithms, N = 20 chai ns were taken, w ith temperatures between 1 and 
60 evenly spaced on a logarithmic scale. As in IKou et al.l |2006| . the PT algorithm used a swap 
between neighboring temperature chains for the exchange operation, but only one swap was pro- 
posed at each iteration, to make it comparable with the PTEEM. For the PTEEM, the same 5 
groups of energy than for the EES were taken. 

Mean acceptance rates for the local Metropolis-Hastings moves and for th e exchange moves 
between chains for the three algorithms are given in Table [2j In comparison Kou et al. |2006| 
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obtained results slightly different probably because the step size Tj was tuned in their EES. 

To compare the ability of each algorithm to explore the distribution space, we consid- 





Local moves 


Exchange moves 


EES 


0.387 


0.799 


PT 


0.337 


0.905 


PTEEM 


0.333 


0.822 



Table 2: Mean acceptance rates for local moves and exchange moves on 100 runs, for EES, PT 
and PTEEM algorithms. 



ered for each run of each algorithm the number and frequency of visited modes by the target 
chain, as well as the estimations of the mean vector {E(X\),E(X2)) and of the second moments 
(E(Xf), E^X^)) using the samples generated from the target chain. Table [3] (A) contains these 
estimations. Concerning the estimations of the mean vector and of the second moments, the EES 
and PTEEM estimates were more accurate than those of the PT, with smaller mean squared 
errors. Moreover, it appeared that the PTEEM estimates were slightly more accurate than those 
of the EES. Concerning the number of visited modes, good results were obtained by the EES 



True value 


E{X 1 ) 
4.478 


E{X 2 ) 
4.905 


E{X x f 
25.605 


E{x 2 y 

33.920 


(A) 


EES 
PT 
PTEEM 


4.448 (0.301) 
3.971 (0.809) 
4.483 (0.324) 


4.953 (0.458) 
4.137 (1.114) 
4.912 (0.454) 


25.229 (3.112) 
21.510 (7.741) 
25.556 (3.366) 


34.226 (4.507) 
27.510 (10.407) 
33.889 (4.406) 


(B) 


EES 
PTEEM 


5.088 (0.373) 
4.745 (0.365) 


6.001 (0.515) 
5.468 (0.491) 


32.005 (4.086) 
28.908 (3.774) 


45.306 (5.638) 
40.617 (5.019) 



Table 3: Estimations of the mean vector (E(Xi), E(X 2 )) and of the second moments 
(E(Xf), E^X^)) using the samples generated from the target chain, obtained on 100 runs for 
EES, PT and PTEEM algorithms. The standard deviations are given between parentheses. (A) 
corresponds to the case with equal variances, and (B) to the case with unequal variances. 



and PTEEM algorithms compared to the PT. The results are reported in Table 01 The mean 
number of visited modes by the PT on t he 100 run s was 14.31, compared to 19.92 for the EES 
and 19.98 for the PTEEM. Then, as in iKou et al.l |2006j], we counted in each of the 100 runs 



PT 


EES 


PTEEM 


2 to 10 missed. 
A mean of 5.69 missed. 


1 missed for 4 runs. 

2 missed for 2 runs. 


1 missed for 2 runs. 



Table 4: Number of missed modes by the 100 runs for EES, PT and PTEEM algorithms. 



for the three algorithms how many times the target chain visited each mode in the last 2500 
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iterations. The absolute frequency error is given by err^ = |/, — 0.05|, where /j is the sample 
frequency of the ith mode being visited (i = 1, . . . , 20). The median and the maximum of errj 
over the 100 runs was calculated. To compare the three algorithms the ratios of these values 
between PT and EES, between PT and PTEEM and between EES and PTEEM were calculated 



for each mode. All these ratios are presented in Table As denoted in lKou et al.l [20061 ] . EES 
seemed to be more efficient than PT: the mean of the ratios R m ed{PT/EES) over the 20 modes 
was 2.42, and the mean of the ratios R ma x(PT/EES) over the 20 modes was 2.92. As expected, 
PTEEM gave better results than PT: the mean of R m ed(PT I pteem) was 2.52, and the mean of 
Rmax(PT / pteem) was 3.07. Besides, we noticed a slight improvement of PTEEM compared to 
EES: 1.05 for the mean of R med (EES/ PTEEM), and 113 for the mean of R ma x(EES/ PTEEM)- 





Mi 


M2 


M3 


M4 


M5 


Me 


M7 


Ms 


M9 


Mio 


Rmed PT/EES 


2.16 


2.80 


2.92 


2.22 


1.98 


2.21 


3.10 


2.07 


2.07 


2.69 


Rrnax PT/EES 


3.59 


2.61 


2.81 


2.10 


1.55 


2.43 


2.54 


1.53 


2.93 


4.50 


R med PT/PTEEM 


2.60 


3.72 


2.63 


2.19 


1.79 


2.97 


2.77 


2.55 


2.32 


2.64 


Rrnax PT/PTEEM 


3.44 


1.76 


2.27 


2.30 


2.44 


3.06 


5.23 


2.92 


2.83 


5.23 


R med EES/PTEEM 


1.21 


1.33 


0.90 


0.99 


0.91 


1.35 


0.89 


1.23 


1.12 


0.98 


R max EES/PTEEM 


0.96 


0.67 


0.81 


1.09 


1.58 


1.26 


2.06 


1.91 


0.96 


1.16 




Mn 


Ml2 


Ml3 


Ml4 


Ml5 


Ml6 


Ml7 


Mis 


Ml9 


M20 


R, ned PT/EES 


2.51 


2.46 


2.77 


2.63 


2.39 


1.76 


3.06 


2.22 


2.10 


2.37 


Rmax PT/EES 


4.58 


1.60 


3.23 


4.61 


3.26 


2.10 


2.83 


4.77 


3.50 


1.36 


Rmed PT/PTEEM 


2.14 


1.98 


1.79 


2.84 


2.75 


2.18 


2.72 


2.78 


2.43 


2.60 


Rmax PT/PTEEM 


3.05 


2.02 


2.35 


4.16 


3.44 


1.79 


3.72 


3.78 


3.50 


2.16 


R med EES/PTEEM 


0.85 


0.81 


0.65 


1.08 


1.15 


1.24 


0.89 


1.25 


1.16 


1.10 


R max EES/PTEEM 


0.67 


1.26 


0.73 


0.90 


1.06 


0.85 


1.32 


0.79 


1.00 


1.58 



Table 5: For each mode, ratios of median {R m ed) an d ratios of maximum (R m ax) are for PT over 
EES, PT over PTEEM, and EES over PTEEM. Each ratio is obtained on 100 runs. 

Figures Q] and [2] show the last 2500 iterations after burn-in for the chains 1, 7, 14 and 20 
obtained by one run of the PT algorithm, and by one run of the PTEEM algorithm. Figure [3] 
shows the simulations after a burn-in period for chains 1 to 5 obtained by a run of EES. The 
first chains of the PTEEM and EES visited all the modes of the target density whereas the first 
chain of PT did not visit all of them. Notice that chains with the highest temperatures of the PT 
algorithm visited all the modes, and these chains for the EES kept in memory lots of iterations. 

Table [6] presents the repartition of accepted equi-energy moves for chains 1, 10 and 20, 
with other possible chains within a run of the PTEEM algorithm. As expected, the closer the 
temperatures of chains were, the more often the equi-energy moves were accepted. Note that 
equi-energy moves had been proposed and accepted for all possible pairs of chains, including for 
pairs of c hains with very d ifferent temperatures. 



As in iKou et al.1 [20061 ] , it appeared that the EES algorithm gave better results than the 
classical PT. Besides the PTEEM algorithm gave results comparable to those of the EES, and 
even slightly better. In this example all the modes have exactly the same energy and the same 
component variance. 
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Simulations of chain 1 (T=1) 



Simulations of chain 7 (T=3.G4) 




Case of unequal variances and energy levels 

In order to study the behavior of the PTEEM and the EES algorithms in case of unequal variances 
and energy levels, we took <j\ = . . . = 05 = 0.4, erg = ... = o\% = 0.1 and (T14 = . . . = 020 = 0.05. 
The modes associated with small variances have lower energy levels, and those associated with 
large variances have higher energy levels. 

The PTEEM and EES algorithm was run 100 times. For each run, the PT and PTEEM algo- 
rithms were run with the same number of iterations and the same Metropolis-Hastings proposal 
as previously. For the PTEEM, we still used 20 chains, but with six energy levels: we took 
H\ = 0.5 and H2 to Hq evenly spaced on a logarithmic scale between 1.5 and 20. The tempera- 
tures were still evenly spaced on a logarithmic scale between 1 and 60. For the EES, six chains 
were used with the same energy levels than for the PTEEM, the temperatures were also evenly 
spaced on a logarithmic scale between 1 and 60, and p ee = 0.1. It is interesting, because the 20 
modes were divided into three energy rings, and the second energy ring contained modes with 
different variances (0.4 2 and 0.1 2 ). 

Concerning the estimations of the mean vector (E(Xi), E(X2)) and of the second moments 
(E(X%),E(X%)), Table [3] (B) shows that the PTEEM estimates were more accurate than those 
of the EES, with smaller mean squared errors. The mean number of visited modes by the PTEEM 
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Simulations of cha in 1 {T=1 } Sim ulatlo ns of chal n 7 (T= 3.64) 




Simulations of oha in 1 4 (T=1 6 .46) Si mulations of chain ZD (T=60) 




2 4 6 6 10 02468 10 

XI X1 



Figure 2: Simulations for chains 1, 7, 14 and 20 obtained by one run of the PTEEM algorithm. 
The colors correspond to the five energy levels. 



on the 100 runs was 18.91, compared to 17.83 for the EES. Concerning the absolute frequency 
errors, we noticed an improvement of PTEEM compared to EES, as the mean of the ratios 
Rmed(EES/ pteem) over the 20 modes was 1.213, and the mean of the ratios R m ax(EES /pteem) 
was 1.203. 

Note that if some components are associated to very small variances (a = 0.01 for instance), it 
became difficult for both the PTEEM and the EES algorithms to detect these modes when they 
are isolated, far from other "larger" modes. 

Remark 4.1 In this last example, the EES algorithm necessitates 72 250 local moves and 5750 
global moves in mean, while the PTEEM algorithm necessitates 100 000 local moves and 5000 
global moves. The goal was to study these two algorithms when they use the same number of energy 
rings (6 here) and with the same number of iterations after burn-in and rings construction for 
the chain of interest. Using 20 chains for the PTEEM matches the advices given in section 13.31 
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3000 iimus, chain 1 (T=!) 



aOOG almus, chain 2 (T=2.») 



BMG ilims, chain 3 (T=7.7) 




Figure 3: Simulations for chains 1 to 5 obtained by one run of the EES. The colors correspond 
to the five energy levels. 



5 Example of estimation using local Gibbs samplers moves 

We consider estimation of model parameters in case of a mixture model with known number of 
components. The classical algorithm used for this kind of problem is a Gibbs sampler. However, 
some difficulties are encountered to combine the original EES with a Gibbs sampler. Therefore, 
we compared only performances of PT and PTEEM algorithms, usi ng the well-known example 
of the Galaxy dataset (see for instance iRichardson and Green |l997j ). 



We consider independent observations y\ , • • • ,y n from k mixture components 



~^™j'/(-l^j>°'f)> i = l,...,n, 

3=1 



with k fixed and known and where f(.\fij,<jj) denotes the density of the Gaussian distribution 
N(^j,a]). The sizes of the k groups are proportional to w±,W2, ■ ■ ■ ,Wk, which are the weights 
of the components. The parameters to be estimated are the means fij, the variances cr|, and the 
weights Wj, for j = 1, . . . , k. The label of the component from which each observation is drawn 



14 





chain 1 


chain 10 


chain 20 


chain 1 


/ 1 rid 

0.00 


4.63 


0.50 


chain 2 


lb. 62 


4.33 


0.62 


chain 3 


1 A OA 

14. o4 


4.z9 


0.64 


chain 4 


11.98 


4.64 


0.70 


chain 5 


9.9b 


4.89 


0.7b 


chain 6 


8.2b 


5.4b 


i no 

1.03 


chain 7 


b.57 


5.7b 


1.17 


chain 8 


b.01 


^ o£ 
6. 2b 


1.50 


chain 9 


4.9b 


6. /4 


o no 
2.03 


chain 10 


A OO 

4.32 


0.00 


o on 
2.30 


ciiaiii 1 1 




7 1 1 
/ .11 




chain 12 


2.85 


6.67 


4.33 


chain 13 


2.42 


6.65 


5.61 


chain 14 


1.98 


6.11 


7.38 


chain 15 


1.61 


5.76 


8.75 


chain 16 


1.44 


5.13 


10.64 


chain 17 


1.15 


4.64 


13.72 


chain 18 


1.08 


4.32 


16.09 


chain 19 


0.87 


3.63 


19.10 


chain 20 


0.62 


2.99 


0.00 



Table 6: Repartition (in %) of accepted equi-energy moves between chain 1 and other possible 
chains (mean on 100 runs of PTEEM). Idem for chains 10 and 20. 



is unknown, and a label vector c which is a latent allocation vector is introduced as follows: 
Ci = j if the observation yi is drawn from the jth component. The variables Cj are supposed 
independent with distributions 

p(ci=j) = Wj, j = l,...,k. 

Write y = (yi)i=i,..., n , £* = (fJ>j)j=i,...,k, ° 2 = {<rj)j=i,...,k, w = (wj) j=lt ... )k and c = (ci) i=li ... >n . 
The [ij and aj 2 are supposed to be independent with the following priors: 

Mi ~ N (£> a" 2 ~r(a,/3) and P~T(g,h), (2) 

where /3 and h are rate parameters. The prior on w is taken as a symmetric Dirichlet distribution 

w~D(6, 5,..., 5). 

The parameters 5, £, k, a, g and h are supposed to be fixed. Let us denote by rrij = z~27=i ^-a=j 
the number of observations labeled by j. The joint posterior density, the full conditional distri- 
butions and the formula of the acceptance rate for the equi-energy move are given in Appendix 

El 
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In this example, the estimates of the parameters obtained after labeling were quite good and 
similar for the PT and PTEEM algorithms. They were even comparable to those obtained with 
a classical Gibbs sampler. The major difference between these three algorithms was the ability 
to explore the parameter space: the Gibbs sampler stayed in local modes for many successive 
iterations, while the PT and PTEEM algorithms easily jumped from one mode to another. Con- 
sequently, we focused on the label-switching phenomenon (see Jasra et al. 2005| ). and not on 
the estimation of the parameters. 



The data consist of the velocities of 82 distant galaxies diverging from our own. We fix the 
number of components to k = 6, and we took for the fixed parameters in ([2]): a = 3, £ = 20, 
S = 1, K = 1/R 2 , g = 0.2 and h = W/R 2 , where R = 10. The algorithms PT and PTEEM were 
run 100 times, each run consisting of 10000 iterations after a burn-in period of 2000 iterations. 
We used 20 chains and 5 energy rings. As in the previous example, the PT algorithm used a 
swap between neighboring temperature chains for the exchange operation, and only one swap was 
proposed at each iteration. Concerning the energy ladder, after a run of a classical Gibbs sampler 
with target density tt, we chose H\ = 180 and H§ = 260. Four energy rings were obtained with 
levels evenly spaced between H\ and H§ on a logarithmic scale, the fifth ring containing all states 
having an energy value higher than H^. The levels obtained were 180, 197.3, 216.3, 237.2 and 
260. We chose iV = 20 temperatures between 1 and 4, with their inverses evenly spaced. Table 
[7] shows for several chains the distributions of states in the energy rings. 
Clearly, the mixture posterior has k\ = 720 symmetric modes and, in theory, for a very high 
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9688 



Table 7: Distribution in the energy rings of states from 10000 iterations, for one run of PTEEM 
and for chains 1, 4, 8, 10 12, 16 and 20. 



number of iterations, the chain of interest should have visited all modes, with equal frequencies. 
Wh en the chain goes fr om one mode to another, there is the so-called label-switching phenomenon 



(see I Jasra et al.l [2005| ). Such a phenomenon is a useful convergence diagnostic to check if the 
chain of interest has explored all possible labelings of the parameters. To compare PT and 
PTEEM algorithms we considered for each run of each algorithm both the number and the 
frequency of visited modes by the target chain. Table [8] shows that on 100 runs of PTEEM the 
target chain visited more modes than on 100 runs of PT. Hence the label-switching phenomenon 
seems to occur more often during a run of PTEEM than during a run of PT. We also counted in 
each of the 100 runs for the two algorithms how many times the target chain visited each mode 
in the last 10000 iterations. The absolute frequency error is given by errj = \ fi — 1/6! |, where 
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mean 


standard deviation 


min 


max 


PT 


645.04 


13.52 


610 


683 


PTEEM 


666.52 


9.23 


641 


692 



Table 8: Means, standard deviations, minimal and maximal values of the number of visited 
modes, on 100 runs of PT and PTEEM. 



fi is the sample frequency of the ith mode being visited (i = 1, ... ,61). We then calculated the 
mean and median of this absolute frequency error over the 100 runs and the 6! modes. Absolute 
frequency errors were slightly lower for PTEEM with a mean (resp. a median) of 0.119% (resp. 
0.099%), compared to 0.126% (resp. 0.099%) for PT. 

We studied further the equi-energy moves of the algorithm PTEEM. In Table [9] it appears 
that exchange moves were more frequent between chains with similar temperatures. The mean 
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Table 9: Proportions (%) of accepted equi-energy moves between chain 1 and other possible 
chains (mean on 100 runs of PTEEM). Idem for chains 10 and 20. 

acceptance rates of the equi-energy moves for PTEEM and of the exchange moves for PT were 
of 49% and 61% respectively. Note that we could implement the PT algorithm so that exchange 
moves can be proposed between any two chains and not only between adjacent chains. But in 
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this case the mean acceptance rate of an exchange move would be much lower. In comparison the 
PTEEM algorithm has the advantage to propose exchanges moves between chains not necessarily 
adjacent, but more relevant in terms of energy levels. 



6 A complex problem: discovery of transcription factor binding 
motifs 



6.1 Model and data 

The discovery of binding motifs in order to understand gene regulation is an important topic 
in biology. Indeed, a first step to understand gene expression is to know which are the cor- 
responding binding sites of a common transcription factor (TFBS). The identification of these 
TFBS is a major computat i onal problem, often s t udied these last twenty years (see for instance 
Stormo and Hartzelll Il989 |. Lawrence and Reilly 1990 |. Lawrence et al. 1993 ] . Liu et al. 1995| 



Jensen et al. 1200 



or 

The data often consist of several homologous DNA sequences, and finding the TFBS is equivalent 
to identifying the starting positions of these sites in the sequences. Denote by S the set of M se- 
quences, each one containing zero, one or more TFBS. Each sequence is made of four nucleotides: 
A, C, G or T. The TFBS are assumed to be of known length w. The length of the rath sequence is 
L m , hence the number of possible starting positions for TFBS is denoted by = L m — (w — 1). 
The total number of motif sites is unknown and is denoted by \A\. As this number is unknown, 
the M sequences (without their w — 1 last nucleotides) are considered as one long sequence of 
length L* = Ylm=i ^m- This long sequence contains \A\ TFBS. To identify the most promising 
positions for the TFBS, we introduce a missing vector A = (a±, 0*2 ■ ■ ■ , at,*), where en = 1 if 
the ith position of the long sequence is the starting point of a TFBS, and aj = otherwise. 
Given A, the set S can be written as the union of two disjoint subsets: S(A) |J S(A C ), where 
S(A) contains the aligned motifs of the identified TFBS, S(A C ) representing the background 
sequence. Two different models are used for these two subsets . Concern i ng th e background se- 
quence, the simplest model is a product multinomial model (see lLiu et al. |l995l | ). but it has been 
shown that a Mark ov model is biologically more relevant and improves the results obtained (see 



Jensen et al.1 [20041 ] ). However, it makes the motif discovery m ore difficult, as repeated patterns 



are local modes for the algorithms. Following iKou et al 
one based on the following transition matrix 



2006] we used a Markov model of order 




a a a 
1 — 3a a a 
a a 1 — 3a 



where a = 0.12. The parameter 8q is assumed to be known (in practice it can be easily well 
estimated from the data). Concerning S(A), it can be seen as a matrix of dimensions \A\ x w, 
with the BSFT in rows. The A:th column contains the nucleotides in A;th position of the \ A\ sites. 
Let C = (Ci,..., C w ) be a count vector, where Ck = {CkA,Ckc->CkG-,CkT) is the vector of the 
nucleotides counts in position k of all the sites. The common pattern of the TFBS is modeled by 
a product multinomial distribution of parameter = (9±, 6 W ) where 6^ = (6kAi &kCi 8kG, Gkr) 
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Figure 4: WebLogo corresponding to the product multinomial used to generate data. 

is a probability vector for the preference of the nucleotide types in position k. According to 
the model, each vector Cfc has a multinomial distribution with parameter 9k independent of the 
other columns. For this example we used 



e 



/ 0.6 0.1 0.6 0,1 0.3 0.2 0.5 \ 

0.8 0.2 0.5 0.25 0.7 

0.2 0.1 0.8 0.7 0.9 0.25 0.2 

\ 0.4 0.7 0.2 0.3 0,1 0.3 0.7 0.1 0.6 0.5 0.1 / 



The corresponding WebLogo dCrooks et al.l |2004^ is given in Figure |6~T1 



To complete the model, conjugate prior distributions are considered. The distribution of 
is a product of Dirichlet with parameters . . . , j3 w ): 



7r(9)a]p 



k ' 



where 



k 



n C- 

j={A,C,G,T} 



The prior probability of a component aj of A is denoted by po, which is the "site abundance" 
parameter: 



IT {A I po) =P j4| ( 1 -P0 



\L*-\A\ 



Finally, this parameter po is assumed to follow a beta distribution Be(a, b). 

From #o an d ©; we generated M = 10 background sequences of length 200, and \A\ = 20 
TFBS of length w = 12. Two TFBS were introduced in each of the ten sequences, hence we 
obtained 10 sequences of length 224. 



6.2 Classical approach: the Gibbs sampler 

To solve the challe nging problem of ident i fying TFBS, baye sian approach es using Gibbs samplers 
were developed by Lawrence et al. 1993| . Liu et al. 1995], or Liu 1994 1. The missing vector A 
giving the starting positions of the TFBS is of interest, hence the aim is to build a Markov chain 
having the posterior distribution of A as stationnary distribution. 
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Followin g IKou et al.1 12006| . in order to obtain the posterior of interest n(A \ S), the collapsing 



technique of iLiul [1994j | is used to integrate out the unknown parameters O and po hi the joint 



posterior distribution. As a consequence these parameters are not updated at each iteration and 
the computation time is reduced. More o ver, t he us e of this technique facilita tes the convergence 

The posterior of 



of the Markov chain, as noted by ILiul 1994j | and Ivan Dvk and Park 
interest is given by 



200 



n(A | S) 



oc 



F(\A\ + a)T(L*-\A\ + b) ^ T{C k + 



ir(S(A)\A,d ) 



r(L* + a + b) 



n 



\n\A\ + \p k \y 



(3) 



with T(C k + 0k) = ]~fj-{.4 c G T) ^( Ckj + 0kj) - Usi ng ([3]), a predictive update version of the 
Gibbs sampler has been proposed by iLiu et al.l )l995l | . They suggest to update each component 
of A independently of the others using the following predictive update formula: 



1 



\A, 



+ a 



p(ai = l\Ai_{i,S) 
p(oi = 0|A hi] , S) n(S(ai) \ A, 6 ) " L* - A H -1 + 6 




no 



(4) 



where the following notations are used: Arx represents the vector A without the ith component, 
S(cii) represents the sites of A starting in position i, C k [_q = (C k [_q A , C fe[ _j ]c , C k [_q G , C fc [_ i]T ) 
is the vector of the nucleotides counts in position k of all the sites, excluding the site starting 
in position i, C k u^ = {C k u\ A i Cfc(i)C) ^k(i)Gi Cfc(i)T) i s the vector of the nucleotides counts in 



position k of the site starting in position i 

C k = C k [_i] + Cfc(i). 

6.3 EES algorithm 



(this vector contains three and one 1). We have 



The a lgorithms resulti ng from the Gibb s sampling approach, such as BioProspector (ILiu et al 



200 J ) or Align ACE (|Roth et al.l Il998ll). ar e ofte n trapped into local modes and true motif 



patterns are not found. Therefore Kou et al. 2006| proposed to use the EES algorithm, which 
seems to improve the global TFBS search. In this algorithm, K chains are used and the lih chain 
has the following target distribution 



tci(A) oc exp 



h(A) V Hi 



with h(A) = -log(n(A \ S)). 



For the target chain Kou et al. 2006| used a Gibbs sampler to generate the vector A. For the 
other chains, given the current sample A, they first estimated the common pattern by a frequency 
counting. Then they built a new v ector according to the Bayes rule, which is accepted according 



to a Metropolis-Hasting move (see lKou et al.l 2006l | for more details). 



6.4 PTEEM algorithm 

In this algorithm, chains are used and the /th chain has the following target distribution 



TT l (A\S) = n(A\S) T i, 
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and are locally updated by Gibbs samplers. Concerning the first chain, the updating of each 
component of A is done using the predictive update formula (J4|). Concerning the Ith chain 
(I > 1), the predictive update formula to be used is the following: 



p(ai = l\A[_j],S) 
p{ai = OlA^S) 



1 



+ a 



n{S(oi) | A,9 Q ) L* - A H] -1 + 6 ^ \\A hi] \ + \(3 k 



n 



k[-i] 



+ Pk 



c, 



k(i)\ T, 



(5) 



Concerning a proposed equi-energy move between two chains l\ and I2 of current states A^ and 
Ai 2 , the acceptance probability is given by: 

p = min{l, 7r ' 1 ^ 3 15 7rh ^ ll ,f? }, 



with 



nAAi 1 s)iri 2 (A h 1 s) 



TT h (Ai 2 I S)TTi 2 (A h I S) 
nAAi I S)TT h {A h I 5) 



^2 1 s ) \w;~w 

ir(A h I S); 1 

7^(^)1^,00) 



B(\A h \ + a,L* - \A h \ + b) 



7r(S(A l2 ) I A l2 , g ) B(| A h \+a,L* - \A h \ + b) 



rd^j + i^i) 



6.5 Results 

The algorithms EES and PTEEM as explained above were run 10 times each, on the data 
presented in 16.11 For each run of PTEEM, N = 15 chains were used, with a burn-in of 200 
iterations and a post-burn-in of 800 iterations, resulting in 15000 local moves and 1000 proposed 
global moves. For each run of EES, K = 9 chains were used, with p ee = 0.1, a burn-in of 200 
iterations and a post-burn-in of 800 iterations, among which 100 iterations were used to construct 
energy rings. It results in 18160 local moves and approximately 1640 global moves. 



Calibration 

Concerning PTEEM, 5 energy rings were used, with energy levels regularly spaced on a logarith- 
mic scale between 10 and 100, giving levels 10, 17, 78, 31.62, 56.23 and 100. The temperatures 
have their inverses regularly spaced between 1 and 1/1.3, giving T m j n = 1 and T max = 1.3. 
Concerning EES, 9 energy rings were used, with energy levels regularly spaced on a logarithmic 
scale between 10 and 100. The temperatures used are the following: 1, 1.001, 1.002, 1.005, 1.01, 
1.02, 1.06, 1.1 and 1.3. This choice has been made in order to permit equi-energy jumps between 
chains. For instance, fixing T\ = 1 and T2 = 1.1, no jumps would have been possible between the 
first and the second chains, because the energies of chains associated with these temperatures 
are too different. 
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Local and global moves 

Concerning the 10 runs of PTEEM, 55.71% of the proposed equi-energy moves were accepted, 
allowing a good mixing of the chains. The first chain exchanged states relatively easily with chains 
of lower orders, and it exchanged states even with chains 10 or 11. Concerning the 10 runs of 
EES, the last 8 chains were locally updated by Metropolis-Hastings algorithms: approximately 
15% of new states proposed for the second to fifth chains were accepted, but only 2.9% were 
accepted for chain 8 and 0.7% for chain 9. The proposed equi-energy jumps were mainly accepted 
(86% in mean), but it is noticeable that very few jumps were proposed between the first and 
the second chain (26 jumps in mean during the 1000 iterations). Indeed, the states of these two 
chains often had energies quite different. As an example, the second chain never obtained a state 
in the first energy ring. 



Identification of the TFBS 

Results of the 10 runs of PTEEM were quite similar, as opposition to the 10 runs of EES. Figure [5] 
represents two boxplots representing empirical posterior probabilities P(cii = 1 | S), i = 1, . . . , L* 
obtained during a run of PTEEM, and during a run of EES. Hence these boxplots represent the 
posterior probabilities of each possible position to be the starting point of a TFBS. Only the 
positions associated to high posterior probabilities are relevant. On the boxplots of figure [5] 
for instance, we decided to keep only the positions with posterior probabilities higher than 0.8. 
Concerning the 10 runs of PTEEM, they identified 16 sites among 20. Among them, 15 were 
identified with exactly the true starting positions, and 1 was identified with 3 other positions 
(positions 877, 880 and 883 were kept, the true one being 880). Concerning the 10 runs of EES, 
they identified in mean 15.6 sites among 20. Among them, 9.6 were identified with exactly the 
true starting positions, and 6 were identified with phase-shifted positions or several positions. 
For example, a site has been identified by positions 1784 and 1791 while the true one was 1784, 
and another has been identified by position 27 while the true one was 26. Notice that 5 EES 
runs among 10 obtained similar results as the PTEEM runs. 



Conclusion on these results 

Results obtained by EES could be improved with a better calibration. In particular, using 
more chains would improve the results (with a supplementary computational cost). However, 
the low number of jumps proposed between chains 1 and 2 is noticeable. It could be due to 
temperatures too far from each others, but as we used T\ = 1 and Ti = 1.001, it should be due 
to the Metropolis-Hastings algorithm used to update the second chain. This algorithm could 
have difficulties to propose relevant states. Indee d, it is not easy t o find a good proposal law for 
new states, and maybe the method proposed by Kou et al. 2006| is not the best possible. The 
difficulties encountered to calibrate the EES and the associated Metropolis-Hastings algorithms 
are a disadvantage to the use of this algorithm. In comparison, the calibration of PTEEM is 
much easier. Indeed, the Gibbs sampler does not need to be calibrated, and if temperatures 
and energy levels are well chosen, the number of accepted equi-energy moves between chains is 
sufficiently large to allow good mixing of the chains. We did not encounter difficulties to calibrate 
these parameters, suggestions of 13.31 giving good results. 

Concerning the results obtained on this challenging example, those obtained with PTEEM were 
slightly better than those obtained with EES. The mixing of the chain of interest was more 
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efficient in PTEEM. Hence, PTEEM identified exactly most of the true starting positions of 
TFBS, while EES tended to identified TFBS with several positions or phase-shifted positions. 
That means that EES was most often trapped in local phase-shift modes. 

Note that this phase-shift problem is en countered by most of the methods us ed to identify TFBS, 
and solutions have been proposed, see Liu 1994 1 or Lawrence et al. 1993| . Implementation of 



these solutions in the algorithms PTEEM or EES is absolutely possible. Similarly, improvements 



Jensen et al. 


2004 


1 and 


Liu et al. 


2001 



7 Discussion 

In this paper a new algorithm combining Parallel Tempering and Equi-Energy Sampler was 
proposed. Inspired by the original idea of the EES, it is based on the use of energy rings. Thanks 
to relevant equi-energy moves, the proposed PTEEM algorithm allows a good exploration of the 
parameter space and good mixing of the generated Markov chains, while ensuring the reversibility 
of the exchange moves. Therefore the generated Markov process theoretically converges to ir* , 
and the first chain generates samples corresponding to the distribution of interest ir. 

Compared to PT, this new algorithm has the same theoretical properties, while outperforming 
it. The drawback is that an energy ladder is needed, but we explained simple and practical ways 
to obtain a relevant ladder, which proved to be efficient. 

Compared to the original EES, this new algorithm has the advantage to be based on Monte 
Carlo Markov chains theory, which is quite simple to use and to understand, even for non- 
experimented users. M oreover, the asy mptotic variance of PTEEM is smaller or equal to those 
of the EES, as noted by Atchade 2010| which compared MCMC algorithms and adaptive MCMC 



algorithms like the EES. On a practical point of view, the PTEEM needs less storage than the 
EES, since all iterations from the past are not kept in memory. Moreover, it can be coupled with 
a Gibbs sampler, unlike the original EES because of an energy truncation. However, the EES 
could be modified to be used without energy truncation. On our examples, PTEEM gave results 
at least as good as those obtained with an EES. 

A direction for future research is to investigate further the theoretical properties of the 
PTEEM algorithm, by comparing convergence rates of PTEEM and PT algorithms for instance. 
Besides an automatic way to build the energy rings could be inspired from the approach of 



Zhou and Wongl [2008l | to reconstruct the energy landscape of the target density. Finally, an 
adaptive PTEEM algorithm to finely tune the temperatures and/or the energy levels during a 
run would also be of interest. 
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A Formula used for the comparisons in case of a Gibbs sampler 
A.l Joint posterior densities 

Write x = (fi,a~ 2 ,w,c,(3). The joint posterior density from which the parameters should be 
drawn is: 

n(x) oc p(y | n, a' 2 , c)p(fi, a~ 2 ,c, f3, w). 
Hence the ith chain should be drawn from 

7Tj(x) oc 7r(x) T i (xp{y \ fi,a~ 2 ,c) T ip(fM,a~ 2 ,w,c, f3) T ^ . 



However, as noted by Jasra et al. 2007| and Behrens et al.l 2012], tempering the whole posterior 
is problematic as there is no guarantee that the tempered posterior will remain proper. As a 
consequence, only the likelihood contribution is tempered and the priors are left untempered. 
The ith chain is then drawn from 

j_ 

7r-(x) oc p(y | x) T *-p(x). 



A. 2 Full conditional distributions 



Concerning the ith. chain, the full conditional distributions to be used in the Gibbs sampler of 
the algorithms are easily obtained through conjugacy. We use the following notations: 



x i = (fi i ,a i ,Wi,Ci,Pi), m 

-2 / -2 -2 -2\ 

Ci = (cn,c i2 , . . . ,c in ), mi = (m il ,m i2 , . . . ,m ik ), 



(jJ>il,(H2> ■ ■ ■ , Ikk), 
(Wil,W i2 , . . . ,w ik ) 
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with p = l,...,k index of component and I = 1, . . . , n index of observation. For /j>i, a i and w. 
the full conditional distributions are the following 



l:c a =p 

\2 



m ip a , ^ (M - ^pf 



2Ti ' ^ 2Ti 

l-.Cii=p 

Wi | Ci,S ~ £>(<5 + ma, 5 + m i2 , • • • ,5 + m ifc ). 

For the allocation vector c, the full conditional distribution is multinomial with the following 
probabilities: 

/ i -2 s 1 / {vi- ^ip) 2 \ 

Pi{Cii =p | y,Hn°i i w i) —exp[ 2o-2 — )Wip. 



//J 



ip 

The parameter /3j has the following full conditional distribution: 



k 

Pi I cr~ 2 , a,g,h~T(g + ka,h + ^2 a if) ■ 

P =i 



A. 3 Acceptance rate of an equi-energy move 



Assuming that two chains i and j are selected from an energy ring to be swapped, the acceptance 
probability of an equi-energy move proposed between two chains is given by 



mm 1 



j-foA - m m(-\ ( p(yM_ Y 1/T i- 1/Ti) 
u Xj ) mm v>v P (y\ Xj )) 



where 



ft n ( \2 

p(y\x) oc \{a p ^)r m -exp(-Y,^ 



- 1=1 2 < 



B Proofs of Proposition 13.11 and Lemma 13.11 
B.l Proof of Proposition 13.11 

During an iteration of the PTEEM algorithm all chains are locally updated by a MCMC algo- 
rithm and an exchange move is proposed. By assumption, PLi(., .) is reversible with stationary 
distribution 7Tj. It is then clear that PL = Y\ i=1 PLi is also reversible. Let A £ B{X) N , which 
can be written as A\ x A2 x . . . x An, with Ai 6 X . We have 

it* {A) = [ PL(s,A)Tr*(ds), 
Jx N 
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which implies that tt* is the stationary distribution of PL(., .). Then, the transition kernel PE 
can be written as 



PE(s,s')=q(s,s')p(s,s')+ / q(s,s")(l- p(s,ds"))t W} (s). 

J X 

A sufficient condition to satisfy the detailed balance condition is the following: 

q(s,ds)p(s,s)ir*(ds) = q(s',ds)p(s f ,s)iT*(ds'). 



(6) 



(7) 



In the PTEEM algorithm, the two candidate chains to exchange their actual states are chosen 
uniformly among all chains in the same energy ring. Hence we have q(s,s') = q(s',s). Using 
(p}, it follows that ([7]) is satisfied, and the detailed balance condition holds. Therefore the 
transition kernel PE for the equi-energy move is reversible, with stationary distribution tt*. The 
transition kernels PE and PL are reversible with stationary distribution tt* . It is then clear 
that P is also reversible and that tt* is its stationary distribution. In addition, each PLi is 
supposed to be strongly A-irreducible and aperiodic, hence PL is aperiodic and strongly Atv- 
irreducible. Since PE is just an exchange kernel between two actual states it is cle ar that P = 
PE*PL is also strongly ATv-hreducible and aperiodic. Theorem 1 of lTiernevi [1994l | then allows 
to conclude. DAn 



autom atic way to build the energy rings could be inspired from the approach of IZhou and Wong 
|2008| to reconstruct the energy landscape of the target density. A run of PTEEM is make with a 
large number of rings and the distribution of samples in the rings is observed to deduce the larger 
and lower values of the energy. Then there are two possibilities: either the obtained distribution 
is studied and some rings are grouped to get some relevant rings, i.e. having about the same 
number of simulations, else a tree of sublevel sets is built as in Zhou and Wong (2008) and at 
least one ring for each local minimum of energy and one ring for each local maximum of energy 
are chosen. It is probably longer than a well calibrated run of PTEEM. 

B.2 Proof of Lemma I3JH 

From Proposition ^. 11 S is reversible with stationary distribution tt*, and strongly Ajv-irreducible. 
It follows that S is positive. Note that a state s' reached from a starting point s after an iter- 
ation of PTEEM can not be part of a set A E X N such that tt*(A) = (proof inspired from 
Roberts and Rosenthal! [2006l |. Theorem 8). 



To show that S is Harris-recurrent we use Theorem 2 of iTierneyl |l994l | that characterizes 
Harris- recurrent chains as follows: a Markov chain is Harris-recurrent if and only if the only 
bounded functions h satisfying 



E(h(S^)\s ) = E(h(sW)\s ) = h(s ), 



Vn E N, 



(8) 



ar e the constant functions. F unctions h sati sfying fl8l) are called h armonic. We use Theorem 6.80 
of lRobert and Casellal [20O4|], inspired from lAthreya et al.l |l996l | as follows: 



If the transition kernel P satisfies: 3B E B(X) N such that 

(i) Vsq, X^nLi Ib -^ >n ( s 0) s)dp(s) > 0, with p the initial distribution of the chain. 
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(ii) inf SiS / eB P(s, s') > 



Then, for 7r*-almost all sq, 



lim sup 

n ^°°AeB(X) N 



[ P n (s ,s)ds- [ Tr*(s)ds\ = 0, 

J A J A 



(9) 



To apply this result, notice that Assumptions (i) and (ii) are verified for B = X . Equation 
is then satisfied for 7r* -almost all sq. 



Using 



AeB(X) 

this equation ([9]) can be written as 



lAWv = sup |^(^4)| = - sup | / h(x)n(dx 
AeB(X) N 2 \h\<i J 



lim sup 

n -*°°\h\<i 



E[h(S n )\s ] - E„.[h(s)] 



We can extend this result for all bounded function h. Moreover, if h bounded satisfies ([8]), then 
E[h(S n )\so] = h(so). We then have h(so) = E^* [h(s)] for 7r*-almost all sq, and h is 7r*-almost ev- 
er ywhere constant and equal to E^* (h(S)). Analysis similar to that in the proof of Theorem 6.80 
of Robert and Casellal 2004 1 shows that h is everywhere constant and equal to E 7r *(/i(S')). The 
Harris-recurrence then follows. 



□ 
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Posterior probabilities of possible binding sites (PTEEM) Posterior probabilities of possible binding sites (EES) 




Figure 5: representing posterior probabilities P{ai = 1 | S),i = 1, . . . , L* , obtained during a run 
of PTEEM, and during a run of EES. 
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